MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ARTIFICIAL INTELLIGENCE LABORATORY 

and 

CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING 
DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES 

A.I. Memo No. 1681 November 1999 

C.B.C.L Paper No. 184 

A note on the generalization performance of kernel 

classifiers with margin. 

Theodoros Evgeniou and Massimiliano Pontil 

This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. 

The pathname for this publication is: ai-publications/1500-1999/AIM-1681.ps 

Abstract 

We present distribution independent bounds on the generalization misclassification perfor¬ 
mance of a family of kernel classifiers with margin. Support Vector Machine classifiers 
(SVM) stem out of this class of machines. The bounds are derived through computations 
of the V 7 dimension of a family of loss functions where the SVM one belongs to. Bounds 
that use functions of margin distributions (i.e. functions of the slack variables of SVM) are 
derived. 
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1 Introduction 


Deriving bounds on the generalization performance of kernel classifiers has been an important 
theoretical topic of research in recent years [4, 8, 9, 10, 12]. We present new bounds on the gen¬ 
eralization performance of a family of kernel classifiers with margin, from which Support Vector 
Machines (SVM) can be derived. The bounds use the V 1 dimension of a class of loss functions, 
where the SVM one belongs to, and functions of the margin distribution of the machines (i.e. 
functions of the slack variables of SVM - see below). 

We consider classification machines of the form: 

min V(yi,f(xi)) 

subject to \\f\W < A 2 (1) 

where we use the following notation: 

• D rn = {(xi,yi),..., (x, m , Vm )}, with (xj,y») G R n x {-1,1} sampled according to an un¬ 
known probability distribution P(x, y), is the training set. 

• V(y, /(x)) is the loss function measuring the distance (error) between /(x) and y. 

• / is a function in a Reproducing Kernel Hilbert Space (RKHS) 7i defined by kernel K, 
with ||/being the norm of / in 7i [11, 2], We also call / a hyperplane, since it is such 
in the feature space induced by the kernel K [11, 10]. 

• A is a constant. 

Classification of a new test point x is always done by simply considering the sign of /(x). 
Machines of this form have been motivated in the framework of statistical learning theory. We 
refer the reader to [10, 6, 3] for more details. In this paper we study the generalization perfor¬ 
mance of these machines for choices of the loss function V that are relevant for classification. In 
particular we consider the following loss functions: 

• Misclassihcation loss function: 

V(y, /(x)) = V msc (yf(x )) = 0(-y/(x)) (2) 

• Hard margin loss function: 

V(y, /(x)) = V hm (yf(x)) = 9(1 - yf(x)) (3) 

• Soft margin loss function: 

V(y, /(x)) = V sm (yf(: x)) = 9(1 - y/(x))(l - y/(x)), (4) 

where 9 is the Heavyside function. Loss functions (3) and (4) are “margin” ones because the only 
case they do not penalize a point (x, y) is if yf(x) > 1. For a given /, these are the points that 
are correctly classified and have distance from the surface /(x) =0 (hyperplane in 

the feature space induced by the kernel K [10]). For a point (x, y), quantity is its margin, 
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Figure 1: Hard margin loss (line with diamond-shaped points), soft margin loss (solid line), 
nonlinear soft margin with a = 2 (line with crosses), and a = | (dotted line) 

and the probability of having > <5 is called the margin distribution of hypothesis /. For 
SVM, quantity 9(1 — y i f('x i ))(l — ^/(x*)) is known as the slack variable corresponding to training 
point (x, : ,;(/,;) [10], 

We will also consider the following family of margin loss functions (nonlinear soft margin loss 
functions): 

V(yJ(x)) = V a (yf(x)) = 9(1 -yf(x))(l -yf(x.)) a . (5) 

Loss functions (3) and (4) correspond to the choice of a — 0,1 respectively. In figure 1 we plot 
some of the possible loss functions for different choices of the parameter a. 


To study the statistical properties of machines (1) we use some well known results that we now 
briefly present. First we define some more notation, and then state the results from the literature 
that we will use in the next section. 

We use the following notation: 

• Re m p(f) = Yf'fLi y(y %, /( x i)) is the empirical error made by / on the training set D m , using 
V as the loss function. 

• R v (/) = J' R „ x {_i : i} V(y, /(x)) P(x, y) c/x dy is the expected error of / using V as the loss 
function. 

• Given a hypothesis space of functions T (i.e. T = {/ G Ti : ||/|| 2 < H 2 }), we note by hXf 
the Vry dimension of the loss function V(y, /(x)) in T, which is defined as follows [1]: 

Definition 1.1 Let A < V(y,f(x.)) < B , / e T, with A arid B < oo. The V^-dimension ofV 
in T (of the set of functions {V(y,f(x.)) \ f 6 F}) is defined as the the maximum number h 
of vectors (xi,yi)..., (xh,yh) that can be separated into two classes in all 2 h possible ways using 
rules: 
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class 1 if: V(y. h /(x*)) >5 + 7 
class -1 if: V(y u /(x;)) <s- 7 

/or / G IF and some s > 0. If, for any number m, it is possible to find m points (x 1; yf)... , (x m , y rn ) 
that can be separated in all the 2 m possible ways, we will say that the V^-dimension ofV in IF is 
infinite. 

If instead of a fixed s for all points we use a different s* for each (x.j, yf), we get what is called the 
fat-shattering dimension fat 7 [1]. Notice that definition (1.1) includes the special case in which 
we directly measure the V/ dimension of the space of functions F, i.e. V(?/,/(x)) = /(x). We 
will need such a quantity in theorem 2.2 below. 

Using the V 1 dimension we can study the statistical properties of machines of the form (1) based 
on a standard theorem that characterizes the generalization performance of these machines. 

Theorem 1.1 (Alon et al., 1993) Let A < V(y,f(x)) < B , f E T, T be a set of bounded 
functions. For any e > 0, for all m > we have that if hYf is the Id, dimension of V in IF for 
7 = ae (a > ^), hXf finite, then: 



where Q is an increasing function of hf r and a decreasing function of e and m, with Q —> 0 as 
m —> 00 . 

In [1] the fat-shattering dimension was used, but a close relation between that and the V/ 
dimension [ 1 ] make the two equivalent for our purpose 1 . Closed forms of Q can be derived (see 
for example [1]) but we do not present them here for simplicity of notation. Notice that since 
we are interested in classification, we only consider e < 1 , so we will only discuss the case 7 < 1 
(since 7 is about ^e). 

In “standard” statistical learning theory the VC dimension is used instead of the Vy one [10]. 
However, for the type of machines we are interested in the VC dimension turns out not to be 
appropriate: it is not influenced by the choice of the hypothesis space IF through the choice of A, 
and in the case that T is an infinite dimensional RKHS, the VC-dimension of the loss functions 
we consider turns out to be infinite (see for example [5]). Instead, scale-sensitive dimensions 
(such as the Uy or fat-shattering one [ 1 ]) have been used in the literature, as we will discuss in 
the last section. 


2 Main results 

We study the loss functions (2-5). For classification machines the quantity we are interested 
in is the expected misclassification error of the solution / of problem 1. With some abuse of 
notation we note this with R msc . Similarly we will note with R hrn , R sm , and R a the expected 
risks using loss functions (3), (4) and (5), respectively, and with Rf™ p > RXm.pi an d Rempi the 

1 In [1] it is shown that Vy < fat 7 < fVy. 
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corresponding empirical errors. We will not consider machines of type (1) with V msc as the loss 
function, for a clear reason: the solution of the optimization problem: 

min E?=i6(-yif(xi)) 
subject to ll/lll: < A 2 

is independent of A, since for any solution / we can always rescale / and have the same cost 

XZiO(-yif(xi))- 

For machines of type (1) that use V sm or V a as the loss function, we prove the following: 

Theorem 2.1 The dimension h for 9(1 — y/(x))(l — y/(x)) cr in hypothesis spaces T A = {/ G 
K\ ll/lll' < A 2 } (of the set of function {6(l-yf(x))(l-yf(x)) ff \ f e T A }) and y e {-1 ,1}, 
is finite for V 0 < 7 . If D is the dimensionality of the RKHS 7i, R 2 is the radius of the smallest 
sphere centered at the origin containing the data x in the RKHS, and B > 1 is an upper bound 
on the values of the loss function, then h is upper bounded by: 

• 0(min(D, ^-r 1 )) for a <1 

7 <j 

• 0(mm(D , {aB ^J R2A2 )) for a > 1 


Proof 

The proof is based on the following theorem [7] (proved for the fat-shattering dimension, but as 
mentioned above, we use it for the “equivalent” V 1 one). 

Theorem 2.2 [Gurvits, 1997] The Vh dimension h of the set of functions 2 T A = {/ € 7d|||/||l; < 
A 2 } is finite for V 7 > 0. If D is the dimensionality of the RKHS, then h < 0(min(D, —4-)), 
where R 2 is the radius of the smallest sphere in the RKHS centered at the origin here the data 
belong to. 


Let 2 N be the largest number of points {(xi, yf), ..., (x 2 at, 1 / 2 /v)} that can be shattered using the 
rules: 

class 1 if 0(1 - 2 /i/(xi))( 1 - yif{xi)) a > s + 7 m 

class - 1 if 0(1 - yj(xi))( 1 - |/i/(xi)) CT <s- 7 

for some s with 0 < 7 < s. After some simple algebra these rules can be decomposed as: 


class 1 if f (xj) - 1 < -(s + 7 )- (for yi = 1 ) 

or f (xj) + 1 > (s + 7 )- (for Vi = -1 ) 

class - 1 if f (xj) - 1 > -(s - 7 )^ (for y % = 1 ) 

or /(x^ + 1 < (s - 7 )- (for Vi = -1 ) 


From the 2N points at least N are either all class -1, or all class 1. Consider the first case (the 
other case is exactly the same), and for simplicity of notation let’s assume the first N points are 

2 As mentioned above, in this case we can consider V(y, /(x)) = /(x). 
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class -1. Since we can shatter the 2 N points, we can also shatter the first N points. Substituting 
Hi with 1, we get that we can shatter the N points {x 1; ..., x^} using rules: 


class 1 if /(xj) + 1 > (s + 7 ) * 
class — 1 if /(xj) + 1 < (s — 7 )^ 


(9) 


Notice that the function /(x^ + 1 has RKHS norm bounded by A 2 plus a constant C (equal to 
the inverse of the eigenvalue corresponding to the constant basis function in the RKHS - if the 
RKHS does not include the constant functions, we can define a new RKHS with the constant 
and use the new RKHS norm). Furthermore there is a “margin” between (s + 7 )and (s — 7 )^ 
which we can lower bound as follows. 

For a < 1, assuming 7 is an integer (if not, we can take the closest lower integer), 


1 

2 



0-7) 


-((s + 7 ) - (s- 7 )) 


^(s + 7)- 

yk = 0 


-1 -k 


(S - 7) fc > 77' 


^-1 


7 -.( 10 ) 


For a > 1, a integer (if not, we can take the closest upper integer) we have that: 

27 = ((s + 7)-) CT - ((s - 7) ° r ) CT = ((s + 7 )- - (s - 7)-) fc((s + 7) ° r ) <T-1 ~ A: ((' s - 7 )") fc ) < 

\k =0 / 


< ((s + 7 )^ - (s - 'y)<')crB~ 


from which we obtain: 

\ (( g + 7)" ~ (a- 7 )^) > pLi (H) 

£ an 

Therefore N cannot be larger than the V 1 dimension of the set of functions with RKHS norm 
< A 2 + C and margin at least 7 ^ for a < 1 (from eq. (10)) and — T for a > 1 (from eq. (11)). 

<tB & 

Using theorem 2.2, and ignoring constant factors (also ones because of C ), the theorem is proved. 

□ 


In figure 2 we plot the U 7 dimension for R 2 A 2 = 1, B = 1, 7 = 0.9, and D infinite. Notice 
that as a —» 0, the dimension goes to infinity. For a = 0 the V 1 dimension becomes the same 
as the VC dimension of hyperplanes, which is infinite in this case. For a increasing above 1, the 
dimension also increases: intuitively the margin 7 becomes smaller relatively to the values of the 
loss function. 

Using theorems 2.1 and 1.1 we can bound the expected error of the solution / of machines (1): 




R v (f) > e} < S(e,m, ft,), 


( 12 ) 


where V is V sm or V a . To get a bound on the expected misclassihcation error R msc (f) we use 
the following simple observation: 


V™%,/(x)) < U CT (|/,/(x)) for V a, 


(13) 
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Figure 2: Plot of the V 1 dimension as a function of a for 7 = .9 

So we can bound the expected misclassihcation error of the solution of machine (1) under V sm 
and V a using the dimension of these loss functions and the empirical error of / measured using 
again these loss functions. In particular we get that for V<r, with probability 1 — Q(e,m, h 7 ^): 

R ms V) < Krn P U) + e (14) 

where e and 7 are related as stated in theorem 1 . 1 . 

Unfortunately we cannot use theorems 2.1 and 1.1 for the V hm loss function. For this loss 
function, since it is a binary-valued function, the dimension is the same as the VC-dimension, 
which, as mentioned above, is not appropriate to use in our case. Notice, however, that for 
u —> 0, V a approaches V hm pointwise (from theorem 2.1 the V 1 dimension also increases towards 
infinity). Regarding the empirical error, this implies that R a —> R hm , so, theoretically, we can 
still bound the misclassihcation error of the solution of machines with V hm using: 

*”““(/) < R h Z r U) + e + ™ x(fi” mp (/) - R h Z„U), 0 ), (15) 

where R% mp (f) is measured using V a for some a. Notice that changing a we get a family of 
bounds on the expected misclassihcation error. Finally, we remark that it could be interesting to 
extend theorem 2.1 to loss functions of the form 9(1 — yf(x))h( 1 — yf(x)), with h any continuous 
monotone function. 

3 Discussion 

In recent years there has been significant work on bounding the generalization performance of 
classihers using scale-sensitive dimensions of real-valued functions out of which indicator func¬ 
tions can be generated through thresholding (see [4, 9, 8 ],[3] and references therein). This is 
unlike the “standard” statistical learning theory approach where classification is typically stud¬ 
ied using the theory of indicator functions (binary valued functions) and their VC-dimension [10]. 
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The work presented in this paper is similar in spirit with that of [3], but significantly different 
as we now briefly discuss. 

In [3] a theory was developed to justify machines with “margin”. The idea was that a “better” 
bound on the generalization error of a classifier can be derived by excluding training examples 
on which the hypothesis found takes a value close to zero (as mentioned above, classification 
is performed after thresholding a real valued function). Instead of measuring the empirical 
misclassification error, as suggested by the standard statistical learning theory, what was used 
was the number of misclassified training points plus the number of training points on which the 
hypothesis takes a value close to zero. Only points classified correctly with some “margin” are 
considered correct. In [3] a different notation was used: the parameter A in equation (1) was 
fixed to 1, while a margin ip was introduced inside the hard margin loss, i.e 9 (ip — yf(x )). Notice 
that the two notations are equivalent: given a value A in our notation we have ip = A in the 
notation of [3]. Below we adapt the results in [3] to the setup of this paper, that is, we set ip — 1 
and let A vary. Two main theorems were proven in [3]. 

Theorem 3.1 (Bartlett, 1998) For a given A, with probability 1 — 5, every function f with 
\\f\W < A 2 has expected misclassification error R' nsc (f) bounded as: 

R m,c (f) < R>Z p ( j) + Jfidln(34em/d) log 2 (578m) + Mi/S), (16) 

where d is the fat-shattering dimension fat 7 of the hypothesis space {/ : \\f\\ 2 K < A 2 } for'y = AA. 

Unlike in this paper, in [3] this theorem was proved without using theorem 1.1. Although prac¬ 
tically both bound (16) and the bounds derived above are not tight and therefore not practical, 
bound (16) seems easier to use than the ones presented in this paper. 

It is important to notice that, like bounds (12), (14), and (15), theorem 3.1 holds for a fixed A 
[3]. In [3] theorem 3.1 was extended to the case where the parameter A (or ip in the notations of 
[3]) is not fixed, which means that the bound holds for all functions in the RKHS. In particular 
the following theorem gives a bound on the expected misclassification error of a machine that 
holds uniformly over all functions: 

Theorem 3.2 (Bartlett, 1998) For any f with ||/||a' < oo, with probability 1 — 5, the mis¬ 
classification error R mcs (f) of f is bounded as: 

R™V) < R’Zpd) + J^(dln(34em/d) log 2 (578m) + in( 8 ||/||/i), (17) 

where d is the fat-shattering dimension fat 7 of the hypothesis space consisting of all functions in 
the RKHS with norm < ||/||| r7 and with 7 = ^J\\- 

Notice that the only differences between (16) and (17) are the ln(8\\f\\/S) instead of Zn(4/c>), and 
that 7 = — jjji| instead of 7 = A^. 

So far we studied machines of the form (1), where A is fixed a priori. In practice learning 
machines used, like SVM, do not have A fixed a priori. For example in the case of SVM the 
problem is formulated [ 10 ] as minimizing: 

min Y™ =1 0{\ -yif(-Xi))(l -yjfc)) + X\\f\\ 2 K (18) 
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where A is known as the regularization parameter. In the case of machines (18) we do not know 
the norm of the solution \\f\\ 2 K before actually solving the optimization problem, so it is not clear 
what the “effective” A is. Since we do not have a fixed upper bound on the norm ||/|| 2 K a priori , 
we cannot use the bounds of section 2 or theorem 3.1 for machines of the form (18). Instead, 
we need to use bounds that hold uniformly for all A (or ip if we follow the setup of [3]), for 
example the bound of theorem 3.2, so that the bound also holds for the solution of (18) we find. 
In fact theorem 3.2 has been used directly to get bounds on the performance of SVM [4], A 
straightforward applications of the methods used to extend theorem 3.1 to 3.2 can also be used 
to extend the bounds of section 2 to the case where A is not fixed (and therefore hold for all / 
with ||/|| < tx)), and we leave this as an exercise. 

There is another way to see the similarity between machines (1) and (18). Notice that the 
formulation (1) the regularization parameter A of (18) can be seen as the Lagrange multiplier 
used to solve the constrained optimization problem (1). That is, problem (1) is equivalent to: 

m 

max x mm f J2 V (Vi, /(x;)) + A(||/|||- - A 2 ) (19) 

i —1 

for A > 0, which is similar to problem (18) that is solved in practice. However in the case of 
(19) the Lagrange multiplier A is not known before having the training data, unlike in the case 
of (18). 

So, to summarize, for the machines (1) studied in this paper, A is fixed a priori and the “regular¬ 
ization parameter” A is not known a priori, while for machines (18) the parameter A is known a 
priori, but the norm of the solution (or the effective A) is not known a priori. As a consequence 
we can use the theorems of this paper for machines (1) but not for (18). To do the second we 
need a technical extension of the results of section 2 similar to the extension of theorem 3.1 to 
3.2 done in [3]. On the practical side, the important issue for both machines (1) and (18) is 
how to choose A or A. We believe that the theorems and bounds discussed in sections 2 and 
3 cannot be practically used for this purpose. Criteria for the choice of the regularization pa¬ 
rameter exist in the literature - such as cross validation and generalized cross validation - (for 
example see [10, 11],[6] and references therein), and is the topic of ongoing research. Finally, 
as our results indicate, the generalization performance of the learning machines can be bounded 
using any function of the slack variables and therefore of the margin distribution. Is it, however, 
the case that the slack variables (margin distributions or any functions of these) are the quan¬ 
tities that control the generalization performance of the machines, or there are other important 
geometric quantities involved? Our results suggest that there are many quantities related to the 
generalization performance of the machines, but it is not clear that these are the most important 
ones. 

Acknowledgments We wish to thank Peter Bartlett for useful comments. Acknowledgments 


References 

[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uni¬ 
form convergnce, and learnability. J. of the ACM, 44(4):615-631, 1997. 


[2] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 686:337-404, 1950. 



[3] P. Bartlett. The sample complexity of pattern classification with neural networks: the 
size of the weights is more important that the size of the network. IEEE Transactions on 
Information Theory, 1998. 

[4] P. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machine and 
other patern classifiers. In C. Burges B. Scholkopf, editor, Advances in Kernel Methods- 
Support Vector Learning. MIT press, 1998. 

[5] T. Evgeniou and M. Pontil. On the v-gamma dimension for regression in reproducing kernel 
hilbert spaces. A.i. memo, MIT Artificial Intelligence Lab., 1999. 

[6] T. Evgeniou, M. Pontil, and T. Poggio. A unified framework for regularization networks 
and support vector machines. A.I. Memo No. 1654, Artificial Intelligence Laboratory, Mas¬ 
sachusetts Institute of Technology, 1999. 

[7] L. Gurvits. A note on scale-sensitive dimension of linear bounded functionals in banach 
spaces. In Proceedings of Algorithm Learning Theory, 1997. 

[8] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk 
minimization over data-dependent hierarchies. IEEE Transactions on Information The¬ 
ory, 1998. To appear. Also: NeuroCOLT Technical Report NC-TR-96-053, 1996, 
ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech_reports. 

[9] J. Shawe-Taylor and N. Cristianini. Robust bounds on generalization from the margin distri¬ 
bution. Technical Report NeuroCOLT2 Technical Report NC2-TR-1998-029, NeuroCOLT2, 
1998. 

[10] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. 

[11] G. Wahba. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, 
SIAM, Philadelphia, 1990. 

[12] R. Williamson, A. Smola, and B. Scholkopf. Generalization performance of regularization 
networks and support vector machines via entropy numbers. Technical Report NC-TR-98- 
019, Royal Holloway College University of London, 1998. 


9 



